Search CORE

32 research outputs found

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

Author: Amarilli Antoine
Amsterdamer Yael
Milo Tova
Publication venue
Publication date: 16/12/2013
Field of study

We study the problem of frequent itemset mining in domains where data is not recorded in a conventional database but only exists in human knowledge. We provide examples of such scenarios, and present a crowdsourcing model for them. The model uses the crowd as an oracle to find out whether an itemset is frequent or not, and relies on a known taxonomy of the item domain to guide the search for frequent itemsets. In the spirit of data mining with oracles, we analyze the complexity of this problem in terms of (i) crowd complexity, that measures the number of crowd questions required to identify the frequent itemsets; and (ii) computational complexity, that measures the computational effort required to choose the questions. We provide lower and upper complexity bounds in terms of the size and structure of the input taxonomy, as well as the size of a concise description of the output itemsets. We also provide constructive algorithms that achieve the upper bounds, and consider more efficient variants for practical situations.Comment: 18 pages, 2 figures. To be published to ICDT'13. Added missing acknowledgemen

arXiv.org e-Print Archive

CiteSeerX

On the Limitations of Provenance for Queries With Difference

Author: Amsterdamer Yael
Deutch Daniel
Tannen Val
Publication venue
Publication date: 01/01/2011
Field of study

The annotation of the results of database transformations was shown to be very effective for various applications. Until recently, most works in this context focused on positive query languages. The provenance semirings is a particular approach that was proven effective for these languages, and it was shown that when propagating provenance with semirings, the expected equivalence axioms of the corresponding query languages are satisfied. There have been several attempts to extend the framework to account for relational algebra queries with difference. We show here that these suggestions fail to satisfy some expected equivalence axioms (that in particular hold for queries on "standard" set and bag databases). Interestingly, we show that this is not a pitfall of these particular attempts, but rather every such attempt is bound to fail in satisfying these axioms, for some semirings. Finally, we show particular semirings for which an extension for supporting difference is (im)possible.Comment: TAPP 201

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

Provenance for Aggregate Queries

Author: Amsterdamer Yael
Deutch Daniel
Tannen Val
Publication venue
Publication date: 01/01/2011
Field of study

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for "simple" queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases

arXiv.org e-Print Archive

CiteSeerX

Crossref

ScholarlyCommons@Penn

Top-k Querying of Unknown Values under Order Constraints

Author: Amarilli Antoine
Amsterdamer Yael
Milo Tova
Senellart Pierre
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 20th International Conference on Database Theory (ICDT 2017)
Publication date: 01/01/2017
Field of study

Many practical scenarios make it necessary to evaluate top-k queries over data items with partially unknown values. This paper considers a setting where the values are taken from a numerical domain, and where some partial order constraints are given over known and unknown values: under these constraints, we assume that all possible worlds are equally likely. Our work is the first to propose a principled scheme to derive the value distributions and expected values of unknown items in this setting, with the goal of computing estimated top-k results by interpolating the unknown values from the known ones. We study the complexity of this general task, and show tight complexity bounds, proving that the problem is intractable, but can be tractably approximated. We then consider the case of tree-shaped partial orders, where we show a constructive PTIME solution. We also compare our problem setting to other top-k definitions on uncertain data

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Hal-Diderot

Putting Lipstick on Pig: Enabling Database-Style Workflow Provenance

Author: Amsterdamer Yael
Davidson Susan B.
Deutch Daniel
Milo Tova
Stoyanovich Julia
Tannen Val
Publication venue: ScholarlyCommons
Publication date: 01/12/2011
Field of study

Workflow provenance typically assumes that each module is a “black-box”, so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (finegrained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance, by using Pig Latin to expose the functionality of modules, thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations, allowing to choose the desired level of granularity in provenance querying (ZoomIn and ZoomOut), and supporting “what-if” workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance

arXiv.org e-Print Archive

ScholarlyCommons@Penn

Optimal Probabilistic Generation of XML Documents

Author: Abiteboul Serge
Amsterdamer Yael
Deutch Daniel
Milo Tova
Senellart Pierre
Publication venue: Springer Verlag
Publication date: 01/01/2015
Field of study

International audienceWe study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values

INRIA a CCSD electronic archive server

Hal-Diderot

On scaling up sensitive data auditing

Author: Agrawal R.
Amsterdamer Yael
Bhagwat D.
Geerts F.
Glavic B.
Glavic B.
Glavic B.
Green Todd J.
Kaushik R.
Lampson Butler
Machanavajjhala A.
Miklau G.
Motwani R.
Sarma A. Das
Seshadri P.
Suciu D.
Weitzner D. J.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Crossref

Front Matter, Table of Contents, Preface, Conference Organization

Author: Amsterdamer Yael
Kimelfeld Benny
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Conference on Database Theory (ICDT 2018)
Publication date: 01/01/2018
Field of study

Front Matter, Table of Contents, Preface, Conference Organizatio

Dagstuhl Research Online Publication Server

LIPIcs, Volume 98, ICDT\u2718, Complete Volume

Author: Amsterdamer Yael
Kimelfeld Benny
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Conference on Database Theory (ICDT 2018)
Publication date: 01/01/2018
Field of study

LIPIcs, Volume 98, ICDT\u2718, Complete Volum

Dagstuhl Research Online Publication Server